On spaced seeds for similarity search
نویسندگان
چکیده
Genomics studies routinely depend on similarity searches based on the strategy of finding short seed matches (contiguous k bases) which are then extended. The particular choice of the seed length, k, is determined by the tradeoff between search speed (larger k reduces chance hits) and sensitivity (smaller k finds weaker similarities). A novel idea of using a single deterministic optimized spaced seed was introduced in [10] to the above similarity search process and it was empirically demonstrated that the optimal spaced seed quadruples the search speed, without sacrificing sensitivity. Multiple, randomly spaced patterns, spaced q-grams, and spaced probes were also studied in [5], [4], [3], and in other applications [11, 12]. They were all found to be better than their contiguous counterparts. In this paper we study some of the theoretical and practical aspects of optimal seeds. In particular we demonstrate that the commonly used contiguous seed is in some sense the worst one, and we offer an algorithmic solution to the problem of finding the optimal seed.
منابع مشابه
SpEED: fast computation of sensitive spaced seeds
SUMMARY Multiple spaced seeds represent the current state-of-the-art for similarity search in bioinformatics, with applications in various areas such as sequence alignment, read mapping, oligonucleotide design, etc. We present SpEED, a software program that computes highly sensitive multiple spaced seeds. SpEED can be several orders of magnitude faster and computes better seeds than the existin...
متن کاملAcoSeeD: An Ant Colony Optimization for Finding Optimal Spaced Seeds in Biological Sequence Search
Similarity search in biological sequence database is one of the most popular and important bioinformatics tasks. Spaced seeds have been increasingly used to improve the quality and sensitivity of searching, for example, in seeded alignment methods. Finding optimal spaced seeds is a NP-hard problem. In this study we introduce an application of an Ant Colony Optimization (ACO) algorithm to addres...
متن کاملCompressed Spaced Suffix Arrays
Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still su...
متن کاملin ri a - 00 00 11 64 , v er si on 1 - 2 4 M ar 2 00 6 A unifying framework for seed sensitivity and its application to subset seeds ( Extended abstract )
We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem – a set of target alignments, an associated probability distribution, and a seed model – that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which ...
متن کاملProtein Similarity Search with Subset Seeds on a Dedicated Reconfigurable Hardware
With a sharp increase of available DNA and protein sequence data, new precise and fast similarity search methods are needed for largescale genome and proteome comparisons. Modern seed-based techniques of similarity search (spaced seeds, multiple seeds, subset seeds) provide a better sensitivity/specificity ratio. We present an implementation of such a seed-based technique on a parallel speciali...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Discrete Applied Mathematics
دوره 138 شماره
صفحات -
تاریخ انتشار 2004